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Replication  and  Distribution  Methods  for  Future  TACS  Distributed  Databases 

William  Perrizo  and  Donald  Varvel 

Efficient  methods  for  distributing  and  replicating  data  are  needed  in 

2 

a  distributed  information  system  serving  well-structured  military  C 

functions  where  the  loss  of  elements  must  be  anticipated.  In  future 

Tactical  Air  Control  Systems,  data  entering  at  various  points  must  be 

replicated  to  the  sites  where  it  is  to  be  used.  We  recognize  also  that 

the  data  distribution  and  replication  issue  must  be  separated  from  the 

2 

issue  of  functional  backup  site  designation  for  C  -elements.  Functional 
backup  site  designation  should  be  a  dynamic  process  under  the  direct 
control  of  the  commander.  We  will  consider  methods  of  data  distribution 
and  replication  based  on  the  following  principles: 

PRINCIPLE  1 

2 

All  data  needed  by  a  site  to  do  its  current  C  function  or  any  future 
functional  assignment  should  be  sent  to  that  site  as  soon  as  possible 
after  it  enters  the  system. 

PRINCIPLE  2 

The  automated  data  distribution  and  replication  method  should  in  no  way 

2 

limit  the  commander’s  flexibility  with  respect  to  C  -function  assignments. 

2 

The  data  replication  site  issue  and  the  C  -element  backup  site  issue  will 


be  kept  separate 


In  this  section,  we  begin  with  a  mathematical  model  of  distribution 
and  replication  methods.  We  will  use  the  phrase  replication  configuration 
to  mean  a  choice  of  data  backup  sites  (primary,  secondary,  ...)  for  each 
site  in  the  system.  We  will  assume  the  data  backup  chain  for  a  given  site 
consists  of  an  ordering  or  chaining  of  all  other  sites  in  the  system, 
where  the  first  site  in  the  chain  is  designated  as  the  primary  data  backup 
site,  the  second  is  designated  as  the  secondary  backup  site,  etc.  We 
recognize  that  it  may  in  some  instances  be  unnecessary  to  have  such  a 
complete  chain  of  data  backup  sites.  The  model  accommodates  such 
instances.  The  full  chain  can  be  truncated  at  the  appropriate  depth. 
However,  even  in  the  instances  where  full  data  backup  seems  unnecessary, 
if  all  data  is  prioritized,  communication  network  idle  time  can  be 
utilized  to  extend  the  data  backup  chain  without  causing  adverse  effects 
on  the  system.  For  purposes  of  this  analysis  then,  we  will  assume  a  full 
data  backup  chain  is  to  be  established  for  each  site. 

It  may  be  desirable  to  have  several  sites  designated  as  co-primary 
backups  (or  co-secondary  backups  ...)  rather  than  just  one  (keeping 
principle  2  in  mind).  To  allow  for  this  generality,  we  consider  the  chain 
of  data  backups  for  a  site,  a,  in  a  system  with  sites,  N={a,b,...}  to  be  a 
sequence  of  subsets  N.  ,  N-  ....  ,  where  N.  is  the  set  of  primary  data 

I  l  |  d  I  y  3 

backups  for  a,  N_  is  the  set  of  secondary  data  backups  for  a,  etc. 
c  ,a 

In  addition  to  the  replication  configuration  method,  we  envision  a 
priority  assignment  method  which  will  assist  the  network  in  queueing  the 
data  for  transport.  The  details  of  this  priority  method  will  not  be 
treated  here,  except  to  say  that  data  being  replicated  to  primary  data 
backup  sites  is  high  priority,  while  data  being  replicated  to  secondary 
data  backup  sites  is  somewhat  lower  priority  etc.  and  high  priority  data 


is  pre-emptive  in  communication  network  queues.  Also,  priorities  will 
depend  on  the  nature  of  the  data  as  well  as  the  destination. 

REPLICATION  CONFIGURATION  FUNCTIONS 

Given  sets  N  and  M,  the  notation  N*M  will  be  used  to  represent  the 
subset  of  the  Cartesian  product  with  the  diagonal  removed  (M»N  = 
NxM-CNnM)x(NnM) ,  where  NnM  is  the  intersection  of  N  and  M).  For  instance 
if  N  s  {a,b,c}  and  M  =  {d,a},  then  N»M  s  { (a,d) , (b,d) , (b,a) , (c,d) , (c,a) } . 
For  convenience,  we  will  always  abbreviate  { a } » N  to  just  a*N.  A 
replication  configuration  for  a  system  with  sites  N={a,b,...}  can  be 
represented  as  a  function,  f,  from  N*N  to  the  positive  real  numbers,  R+. 

Given  a  replication  configuration  for  a  system  with  n  sites,  N  = 
(a,b, . . . } ,  we  will  say  a  function  f:N*N— R+  represents  the  replication 
configuration  if  for  each  site,  a, 


N1,a»  N2, a*  M3,a’  *** 

is  the  data  backup  chain  for  a,  where  B.  =  lb  in  N  |  f(a,b)  = 

I  9  a 

min(f(a*N))  }  ,  N,  :  {b  in  N  !  f(a,b)  =  min(f (a*(N-N,  ))  },.... 

We  will  say  f  canonically  represents  the  replication  configuration  if  f 
assigns  the  value  1  to  all  primary  backups,  the  number  2  to  all  secondary 
backups,  etc.  More  generally,  f  will  be  called  canonical  if  f:N*N — Z+ 
(positive  integers),  and  f”^(i)  is  non-empty  for  each  i  in  { 
1 ,2, . . . ,max(f )  }  (that  is,  f  assigns  each  integer  between  1  and  its  max  to 
at  least  1  site). 


FACT  1: 


Every  replication  configuration  is  represented  by  one  canonical  function 
and  each  canonical  function  represents  a  unique  replication  configuration. 
Thus,  canonical  functions  completely  characterize  replication 
configurations.  (Recall  that  replication  configurations  with  depths  of  1 
or  2  etc,  can  be  derived  from  these  by  truncation.) 


This  fact  1  follows  from  the  consideration  that,  given  a  replication 
configuration,  the  function  which  assigns  1  to  all  primary  data  backup 
sites,  2  to  all  secondary  data  backup  sites,  3  to  all  third  level  data 
backup  sites,  etc.,  is  canonical  and  that  any  canonical  function,  f, 
prescribes  a  replication  configuration  where  the  chain  for  site,  a,  is 

N1,a*  N2,a*  N3,a‘  *  *  *  *  Nmax(f),a 

(the  sets  N.  are  defined  above). 
i,a 

When  it  is  necessary  to  distinguish,  N,  .  and  N  .  will  be  used  to 

1 »*»a  Bfl»3 

refer  to  the  sets  generated  by  the  functions  f  and  g  respectively  (note:  f 
g  are  functions,  i  is  a  running  index,  and  a  is  a  site). 


FACT  2 


Define  the  relation,  R(f,g)  if  Nr  .  =  N  ,  for  every  a  in  N  and 

t9i9a  8 1 1 « 

every  i  in  {1,2,...}.  The  relation  R  is  an  equivalence  relation  which 
partitions  all  f:N#N — R+  into  distinct  classes  such  that  each  class 
corresponds  to  one  configuration  and  contains  exactly  one  canonical 
function. 


This  is  a  useful  fact,  for  it  allows  us  to  define  a  configuration  using 
any  function  (such  as  distance)  without  worrying  about  the  explicit 


canonical  function. 


DISTRIBUTION  METHODS 


I.  Physical  distance  in  the  theater  gives  us  one  replication 

2  2 

configuration.  Define  f(a,b)=SQRT((a1-b1>  ♦(a^-b^)  ),  where  the  position 
of  site  a  is  (a^a^)  in  some  Cartesian  coordinate  system.  This 
configuration  provides  a  separate  data  backup  chain  for  each  site. 

At  the  back  of  this  report,  the  question  of  data  backup  load  is 
studied  using  analytic  and  statistical  techniques.  From  analytic 
considerations,  one  can  conclude  that  the  expected  number  of  sites  for 
which  a  given  site  will  be  primary  data  backup  is  1  and  the  maximum  number 
of  sites  for  which  a  given  site  will  be  primary  data  backup  is  5.  Using 
Monte  Carlo  techniques,  the  distribution  of  primary  data  backup  loads  was 
calculated.  A  graph  of  the  results  of  this  study  is  included  at  the  back 
of  this  report.  The  program  used  to  produce  these  results  is  included 
following  the  graph. 


backup  chain  for  site  £  backup  chain  for  site  b 

using  physical  distance  using  physical  distance 


backup  chain  for  site  £  backup  chain  for  site  d^ 

using  physical  distance  using  physical  distance 


backup  chain  for  site  £  backup  chains  for  all  sites 

using  physical  distance  using  physical  distance 


figure  1 


ADVANTAGES  OF  THE  PHYSICAL  DISTANCE  METHOD 


1.  Data  backup  sites  are  close  to  the  backed  up  site  itself.  Therefore, 
for  sites  with  area  based  responsibilities,  the  backup  data  could  be 
expected  to  extend  and  enhance  the  resident  data.  In  that  sense  it 
will  tend  to  enhance  the  information  content  of  the  resident  data. 

2 

2.  It  can  be  expected  to  be  fairly  robust  under  C  -element  moves  over 
short  distances,  since  such  moves  will  only  slightly  alter  the 
physical  distance  basis. 

3.  Backup  loads  are  quite  well  distributed  among  the  sites  (see  study  at 
the  back  of  this  report). 


DISADVANTAGES  OF  THE  PHYSICAL  DISTANCE  METHOD: 


1.  The  data  backup  capability  is  vulnerable  to  corridor  attacks,  due  to 
physical  proximity. 


II.  RECIPROCAL  OF  DISTANCE  method,  the  function  is  f(a,b)  =  1/distance. 


ADVANTAGES: 


1.  It  can  be  expected  to  be  immune  to  corridor  attacks,  since  data 
backups  are  far  removed  from  the  site  which  is  being  backed  up. 


p 

2.  It  can  be  expected  to  be  robust  under  C  -element  moves  over  short 
distances,  since  such  moves  will  only  slightly  alter  physical  distance 
and  therefore  the  reciprocal  of  distance  basis. 

DISADVANTAGES: 

1.  Backups  are  chosen  in  inverse  relationship  to  proximity.  Thus,  no 
blending  and  enhancement  of  resident  data  can  be  expected. 


backup  chain  for  site  ia 
using  reciprocal  of  distance 


backup  chain  for  site  b 
using  reciprocal  of  distance 


backup  chain  for  site  c 
using  reciprocal  of  distance 


backup  chain  for  site  d 
using  reciprocal  of  distance 


backup  chain  for  site  e 
using  reciprocal  of  distance 


backup  chains  pattern 
using  reciprocal  of  distance 


figure  2 
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III.  CIRCULAR  CHAIN  METHOD: 

First  we  enumerate  the  sites  Ns  {a^,  a^,  .  .  .  }  according  to  an 

appropriate  scheme  (distance  from  a  fixed  point  or  random  enumeration, 

etc.).  Then  we  define  one  closed  chain  following  that  enumeration. 

Technically,  the  function  is  fia^a^)  »  (j  -  i)mod(n). 

ADVANTAGES: 

1.  The  configuration  is  easily  tuneable  via  the  enumeration  choice. 

2.  The  class  of  configurations  provide  maximum  distribution  of  backup 
loads,  in  the  sense  that  each  site  is  primary  data  backup  for  exactly 
one  other  site,  each  is  secondary  data  backup  for  exactly  one,  etc. 

3.  It  is  quite  easily  reconflgurable,  should  the  need  arise. 

4.  The  data  transport  priority  scheme  can  be  much  simpler  under  this 
replication  configuration  method,  since  each  site  could  send  data  to 
the  next  site  on  the  chain  as  it  receives  the  data  itself,  without  the 
necessity  of  considering  other  issues.  Thus,  the  priority  scheme 
might  depend  only  upon  the  nature  of  the  data  itself. 

DISADVANTAGES: 

1.  It  may  prove  too  simplistic  for  future  needs. 


backup  chain  for  site  £  backup  chain  for  site  b 

using  enumeration  a,c,b,d,e  using  enumeration  a,c,b,d,e 


backup  chain  for  site  c  backup  chain  for  site  <j 

using  enumeration  a.c.b^.e  using  enumeration  a,ctb,d,e 


a  b 


backup  chain  for  site  e 
using  enumeration  a,c,b,d,e 


a  b 


backup  chains  for  all  sites 
using  enumeration  a,c,b,d,e 


figure  3 


IV.  RANDOM  METHOD: 


The  function,  f(a,b),  would  involve  random  generation  of  a  positive  number 
for  each  pair,  (a,b).  This  could  be  done  allowing  or  disallowing  two 
pairs  to  have  the  same  number. 

ADVANTAGES: 

1.  The  method  would  be  difficult  for  the  enemy  to  decipher. 

DISADVANTAGES: 

1.  There  would  be  no  control  over  such  issues  as  backup  loads,  corridor 
immunity,  etc. 


V.  TRUNCATED  METHODS: 

Primary  and  primary-secondary  data  backup  sites  are  designated,  but  no 
further  backups  are  designated.  This  can  be  done  in  conjuction  with  an 
of  the  above  methods. 

EVALUATION  OF  ALTERNATIVES: 

Evaluation  of  these  alternatives  is  needed.  An  evaluator  might  involve  a 
simulation  of  the  effects  of  battle  on  data  backup  capability.  It  should 
include  considerations  of  the  following  issues: 


SURVIVABILITY: 

1.  immunity  to  corridor  attacks, 

2.  graceful  degradation  under  attack, 


3.  robustness  under  changes  in  site  position. 

ROBUSTNESS  under  external  change  in  the  system. 

SPEED:  (assuming  fixed  data  loads  and  communication  network  capacities) 

1.  time  necessary  to  provide  full  primary  data  backup  capacity, 

2.  time  necessary  to  provide  full  secondary  data  backup  capacity, 

3.  time  necessary  to  provide  full  data  backup  capacity  at  all  levels. 

FLEXIBILITY: 

1.  capability  for  tuning  and  dynamic  configuring, 

2.  ease  of  tuning  and  dynamic  reconfiguring. 
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{•  PROGRAM  TO  SIMULATE  USE  OF  CLOSEST  NEIGHBOR  »} 
{»  AS  A  BACKUP  CRITERION  •} 
{•  »> 
{»  WRITTEN  JULY,  1984  »} 
{»  BY  DONALD  A.  VARVEL  •} 
. . . . . . . 


Program  Neighbor_Simulation  (Input,  Output); 

Const  Max_points  =  200;  {  Arbitrary;  may  be  changed  ) 

Type 

Point  =  Record  X,  Y  :  Real  End; 

Neighbor  =  Record  P2  :  Integer;  Distance  :  Real  End; 

Close  =  Array[ 1 . ,Max_Points]  of  Neighbor; 


Var 


Plotter  :  text; 

Layout  :  Arrayt 1 . .Max_Points]  of 

Closest  :  Close; 

Stats  :  Array[0..5]  of  Integer; 

N  :  Integer; 

I,  J  :  Integer; 

Seed  :  Integer^; 

HoldDist  :  Real ; 

Count  :  Integer; 

Present  :  Integer ; 

Flag  :  Boolean; 

Y_OR_N  :  Char; 


{  The  plotter  } 

Point; 

(  The  set  of  randomly-generated  points  ) 

{  Array  of  closest  neighbor  and  distance  } 
(  Results  of  simulation  } 

{  Actual  number  of  points;  input  ) 

{  Loop  control,  subscripts  ) 

{  Random  number  generator  seed  } 

{  Temporary  place  for  distance  } 

{  Accumulates  occurances  for  Stat  } 

{  Holds  present  P2  value  for  Stat  ) 

{  Flow-of-control  crutch  } 

{  Reply  from  keyboard  } 


{  Function  to  compute  Cartesian  distance  } 

{  between  two  points  PI  and  P2.  The  } 

{  extraction  of  the  root  has  been  } 

{  omitted  in  the  interests  of  execution  } 

{  speed,  since  it  does  not  affect  order.  } 

Function  GE0DESIC(P1,  P2  :  Point)  :  Real; 

Begin 

GEODESIC  :=  SQR( PI .X-P2.X)  +  SQR( PI .Y-P2.Y) 
End; 


(  Generate  uniform  random  numbers  in  the  } 
(  range  0.0  <=  R  <  1 .0  ) 

Function  RANDOM(Var  Seed  :  Integer 4)  :  Real; 
Const  BigNum  =  25997; 

Modulus  =  32768; 

Var 

B  :  Boolean; 

Begin 

Seed  :=  (Seed  *  BigNum)  mod  Modulus; 

RANDOM  :=  FL0AT4(Seed)  /  32768.0 
End; 


{  Shell  sort,  by  P2  ascending.  Initial  } 
{  distance  is  chosen  by  what  appears  } 
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{  to  be  an  arcane  method,  but  which 
{  gives  good  results. 


} 

} 


Procedure  SORKVar  Tosort  :  Close;  N  :  Integer); 

Var 

Dist  :  Integer ; 

I,  J  :  Integer; 

TerapP2  :  Integer; 

TempDist  :  Real; 

Begin 

Dist  :=  TRUNC( EXP( LN( 2)*TRUNC( LN( N)/LN( 2) ) ) )  -  1; 

{  2*m  -  1,  where  2*m  <=  N:  } 

{  Known  good  choice  for  Shell3ort  D  } 

While  Dist>0  do  begin 

For  I:=1  to  N-Dist  do  begin 
TempP2  :=  Tosort! I+Dist] .P2; 

TempDist  :=  Tosort[ I+Dist) .Distance; 

J  :=  I; 

While  TempP2  <  Tosort! J].P2  do  begin 
Tosort! J+Dist) .P2  :=  Tosort[ J] .P2; 

Tosort! J+Dist] .Distance  :=  Tosort! J] .Distance; 

J  :=  J  -  Dist; 

If  J  <  0  then  Break 
End;  {  While  TempP2  ...  } 

To sor t t J+Dist ].P2  :=  TempP2; 

Tosort[ J+Dist) .Distance  :=  TempDist 
End ;  {  For  I  . . .  } 

Dist  :=  Dist  div  2 
End  {  While  ...  } 


Begin  {  Main  Program  ) 

{  Get  input  parameters  } 

Repeat 

Write!’ Enter  number  of  points,  <=  ',  Max_Points: 1 ,  '); 

Readln(N) 

Until  N  <=  Max_Points; 

WriteC Enter  random  number  seed.  ’); 

Readln(Seed) ; 

Seed  :=  Seed  mod  32768;  {  Avoid  overflow  in  RANDOM  } 

{  Random  layout  } 

For  I  :=  1  to  N  do  begin 

Layout [ I ).X  :=  RANDOM(Seed) ; 

Layout [ I ).Y  :=  RANDOM(Seed) 

End ;  {  For  I  . . .  ) 

{  Compute  closest  distances  } 

For  I  :=  1  to  N  do  begin 
Closest[ I] ,P2  :=  0; 

Closest! I). Distance  :=  4.0;  l  Actual  distances  <  2  ) 

For  J  :=  1  to  N  do 
If  IOJ  then  begin 

HoldDist  :=  GEODESIC! Layout [I],  Layout[J)>; 

If  HoldDist  <  Closest! I] .Distance  then  begin 
Closest! I] .Distance  :s  HoldDist; 

Closest! I) ,P2  :=  J 
End  {  then  ...  } 


End 


{  For  I 


} 


SORT( Closest ,  N) ; 


{  Sort  on  P2  } 


{  Generate  stats  } 

For  ls=0  to  5  do  Stats[I]  :s  0; 

For  I  : r  1  to  N  do  begin 
Flag  :=  False; 

For  J  :=  1  to  N  do 

If  Closest [ J] ,P2  =  I  then  Flag  :=  True; 

If  not  Flag  then  Stats[0]  :=  StatsCO]  +  1 
End; 

J  :=  1; 

While  J  <=  N  do  begin 
Count  : =  0 ; 

Present  :=  Closest [ J] .P2; 

Repeat 

Count  :  =  Counts  1; 

J  :=  J+1 

Until  (J  >  N)  or  ( Closest [J] ,P2  <>  Present); 
Stats[Count]  :  =  Stats[Count]  +  1 
End;  [  While  ...  } 


Writeln;  Writeln; 

Writeln( ' Simulation  run  with  1 
Writeln( 'Closest  neighbor  of 

Writeln  ( 1 - ..... - - - 

For  I;=0  to  5  do  WritelnC 


{  Print  stats  } 

M:1,  '  points:* >; 

Frequency* ) ; 

1:1,  '  ’  Stats[I]) 


{  Output  options  } 

Repeat 

Write ('Output  to  plotter?  (Y/N)  '); 

Readln(  Y_OR_N) 

Until  (Y_OR_N  =  'Y’)  or  (Y_OR_N  =  'y')  or  (Y_OR_N  =  'N')  or  (Y_OR_N  =  'n') 


{  Plotter  output  } 

If  ( Y_OR_N  =  'Y')  or  (Y_OR_N  =  'y')  then  begin 
Assign (Plotter,  ' PRN: ' ) ; 

Rewrite( Plotter) ; 

Writeln( Plotter,  *;:  EH  A  ' ) ;  {  Init,  small  chart,  absolute  locations  ) 
Writeln( Plotter,  '300,400  D  300,1400  U  300,400  D  2000,400  U  '); 

{  Axes  } 

Writeln( Plotter,  '300,325  S13  0_  '); 

(  Label  origin  } 


For  I  :=  1  to  5  do 

Writeln  (Plotter,  'A  ',  3004-300*  1: 1 ,  ',',  450:1, 

'  R  D  0,-50  U  0,-75  S13  1:1,  * ) J 

{  X  tic  marks  and  labels  } 

Writeln( Plotter,  'A  350,817  R  D  -50,0  U  -100,0  S13  25_  '); 
Writeln (Plotter,  'A  350,1234  R  D  -50,0  U  -100,0  S13  50_  '); 
Writeln( Plotter,  'A  100,1000  S13  i_  '); 

{  Y  tic  marks  and  labels  } 


Writeln( Plotter ,  'P2  A  '); 

For  I  :=  0  to  5  do 

Write( Plotter ,  (I+1)*300:1,' ,' , 

TRUNC(FLOAT( Stats[ I] )  •  (1667.0  /  N))  ♦  400:1,'  M2+4  •); 
Writeln (Plotter) ; 
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